Assignment

In this homework assignment, you will explore, analyze and model a data set containing information on approximately 12,000 commercially available wines. The variables are mostly related to the chemical properties of the wine being sold. The response variable is the number of sample cases of wine that were purchased by wine distribution companies after sampling a wine. Your objective is to build a count regression model to predict the number of cases of wine that will be sold given certain properties of the wine.

We’ll build two poisson regressions, two negative binomial regressions, and two multivariate linear regression models.


Data Exploration

In order to explore summary stats and distribution characteristics of our dataset, we’ll need to first conduct some basic transformations and cleanup:

  • The provided dataset contains a single response variable target, a numeric variable indicating the number of cases purchased.
  • The ‘evaluation’ dataset contains no values for target, suggesting this data might be used for prediction rather than validation and evaluation of model performance. For clarity we’ll rename this this dataset ‘prediction’ instead and create a separate validation hold-out from the training data.
  • There is a numeric index column labeling the observations which can be excluded from the models.
  • 3335 observations (or 21%) of the total dataset have been set aside for prediction.
  • The combined training and prediction datasets consist of 16130 observations containing 14 predictor variables:
variable complete_rate n_missing min max
acidindex 1.00 0 4.00 17.00
alcohol 0.95 838 -4.70 26.50
chlorides 0.95 776 -1.17 1.35
citricacid 1.00 0 -3.24 3.86
density 1.00 0 0.89 1.10
fixedacidity 1.00 0 -18.20 34.40
freesulfurdioxide 0.95 799 -563.00 623.00
labelappeal 1.00 0 -2.00 2.00
ph 0.97 499 0.48 6.21
residualsugar 0.95 784 -128.30 145.40
stars 0.74 4200 1.00 4.00
sulphates 0.91 1520 -3.13 4.24
totalsulfurdioxide 0.95 839 -823.00 1057.00
volatileacidity 1.00 0 -2.83 3.68

Definitions

  • AcidIndex: Proprietary method of testing total acidity of wine by using a weighted average
  • Alcohol: Alcohol Content
  • Chlorides: Chloride content of wine
  • CitricAcid: Citric Acid Content
  • Density: Density of Wine
  • FixedAcidity: Fixed Acidity of Wine
  • FreeSulfurDioxide: Sulfur Dioxide content of wine
  • LabelAppeal: Marketing Score indicating the appeal of label design for consumers. High numbers suggest customers like the label design. Negative numbers suggest customers don’t like the design.
  • ResidualSugar: Residual Sugar of wine
  • Stars: Wine rating by a team of experts. 4 Stars = Excellent, 1 Star = Poor. A high number of stars suggests high sales
  • Sulphates: Sulfate content of wine
  • TotalSulfurDioxide: Total Sulfur Dioxide of Wine
  • VolatileAcidity: Volatile Acid content of wine
  • pH: pH of wine

Transformed Variables?

One of the first characteristics that stand out is the presence of negative values for many chemical compounds, and the relative normality of their distributions. This suggests they have already been power-transformed to produce normal distributions for modeling.

Variables related to sugars, chlorides, acidity, sulfides and sulfates all seem to fall in this category. Considering that we are analyzing very tiny amounts of chemical compounds, we might assume their natural distributions may be highly skewed.

We tried exponentiation of these variables by the natural log and other values, but did not arrive at an obvious or consistent transformation approach - so we may not be able to interpret model results on the scale of the original values for these variables.


Handling Missing Data

Next we’ll find and impute any missing data. There are 8 predictor variables that contain NAs:

is_na pct
stars 4200 0.26
sulphates 1520 0.09
residualsugar 784 0.05
chlorides 776 0.05
freesulfurdioxide 799 0.05
totalsulfurdioxide 839 0.05
alcohol 838 0.05
ph 499 0.03

Heeding the warning in the assignment, “sometimes, the fact that a variable is missing is actually predictive of the target”, we’ll consider each of these variables carefully. While there may be data “missing completely at random” (MCAR) that we wish to impute, this may not always be the case.

Missing Data - Stars

The predictor Stars suggests that out of 16,000 wine samples, about 25% have never been professionally reviewed. If we assume that the existence of a review has some impact on the sales of a wine brand (whatever the reviewer’s sentiment), then imputing mean or predicted values here might distort our model.

To enable further analysis we’ll convert stars from a numeric to a factor, with a level ‘0’ representing our NA values.

Missing Data - Chemical Compounds

Next we consider some of the missing chemical compounds in our wines; alcohol, sugars, chlorides, sulfites and sulfates, and measures such as ph.

First, can safely assume that all wines in this dataset have an actual ph score greater than zero (which would represent the most acidic rank, such as powerful industrial acids.) We’ll want to impute more reasonable values for these.

Based on some reading into the organic wines segment, there is a growing demand in the market for specialty products such as low-sulfite, low-sugar and low-alcohol wines. However, this still represents a very small segment of the overall market, and chemically it’s not likely for these compounds to be completely absent from the final product.

Additionally, the predictors freesulfurdioxide and totalsulfurdioxide are linked - the amount of ‘Free’ SO2 in wine is always a subset of the ‘Total’ S02 present. We only identified 59 cases where both these values were NA, while over 1500 cases had missing values for only one or the other.

Based on these observations, we’ll use the MICE imputation method to predict and impute the missing values for residualsugar, chlorides, freesulphurdioxide, totalsulfurdioxide, sulphates, alchohol and ph.

Target/source labels and non-chemical predictors labelappeal and stars were excluded as predictors for the imputation.

Data Sparseness - Label Appeal

labelappeal is a numeric score of consumer ratings for a wine brand’s label design. It has also been pre-transformed to produce a normal distribution for modeling; however this is a very sparse variable with nearly half the cases having a value of zero.

This may be candidate for handling with Zero-Inflated models. We won’t change the values here, but will convert labelappeal from a numeric to a factor.


Examine Final Dataset

We now have reasonably imputed values, and nearly-normal distributions for our numeric predictors, taking special note of the frequency of zero values for labelappeal and stars.

variable n_missing n_zero
acidindex 0 0
alcohol 0 5
chlorides 0 8
citricacid 0 151
density 0 0
fixedacidity 0 47
freesulfurdioxide 0 12
labelappeal 0 7087
ph 0 0
residualsugar 0 6
stars 0 4200
sulphates 0 29
totalsulfurdioxide 0 12
volatileacidity 0 22

Split Datasets

With transformations complete, we split back into training and prediction datasets based on our source_flag, and create a 15% validation hold-out from the training data.


Build Models

Poisson Regression 1

Poisson Regression assumes that the variance and mean of our dependent variable target are roughly equal, otherwise we may be looking at over- or under-dispersion.

pr1 <- glm(target ~ ., family = 'poisson', data = df_train)
## 
## Call:
## glm(formula = target ~ ., family = "poisson", data = df_train)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         6.607e-01  2.164e-01   3.053 0.002268 ** 
## fixedacidity       -3.809e-04  8.856e-04  -0.430 0.667148    
## volatileacidity    -3.387e-02  7.093e-03  -4.776 1.79e-06 ***
## citricacid          5.441e-03  6.408e-03   0.849 0.395807    
## residualsugar       7.920e-05  1.632e-04   0.485 0.627481    
## chlorides          -3.139e-02  1.735e-02  -1.810 0.070361 .  
## freesulfurdioxide   8.966e-05  3.701e-05   2.422 0.015418 *  
## totalsulfurdioxide  8.205e-05  2.422e-05   3.388 0.000704 ***
## density            -2.293e-01  2.084e-01  -1.101 0.271049    
## ph                 -1.128e-02  8.190e-03  -1.377 0.168366    
## sulphates          -6.587e-03  5.906e-03  -1.115 0.264669    
## alcohol             3.583e-03  1.500e-03   2.388 0.016928 *  
## labelappeal-1       2.206e-01  4.128e-02   5.345 9.05e-08 ***
## labelappeal0        4.158e-01  4.021e-02  10.341  < 2e-16 ***
## labelappeal1        5.450e-01  4.090e-02  13.325  < 2e-16 ***
## labelappeal2        6.927e-01  4.609e-02  15.030  < 2e-16 ***
## acidindex          -7.994e-02  4.996e-03 -16.001  < 2e-16 ***
## stars1              7.850e-01  2.123e-02  36.981  < 2e-16 ***
## stars2              1.096e+00  1.984e-02  55.225  < 2e-16 ***
## stars3              1.215e+00  2.088e-02  58.193  < 2e-16 ***
## stars4              1.330e+00  2.633e-02  50.523  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 19379  on 10831  degrees of freedom
## Residual deviance: 11477  on 10811  degrees of freedom
## AIC: 38540
## 
## Number of Fisher Scoring iterations: 6
x
AIC 38539.67
Dispersion 0.88
Log-Lik -19248.84

We note that our model has generated ‘dummies’ from our categorical variables labelappeal and stars, and of the 20 total predictors, all but five have statistical significance.

Notably, our Dispersion Parameter is 0.88, which suggests a degree of under-dispersion in the data.

Diagnostics

By graphing our target values (green) against our predicted values (blue) we can easily see this model tends to under-predict the higher count levels, and wildly over-predict the lower count levels.


Poisson Regression 2

We’ll build a Zero-Inflated Poisson model to handle the large number of zero values in our labelappeal and stars predictors, to see if we can improve model accuracy.

pr2 <- zeroinfl(target ~ . | ., data=df_train, dist = 'poisson')
## 
## Call:
## zeroinfl(formula = target ~ . | ., data = df_train, dist = "poisson")
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.275897 -0.428242 -0.001392  0.382528  5.195474 
## 
## Count model coefficients (poisson with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.205e-01  2.238e-01   3.219 0.001288 ** 
## fixedacidity        1.885e-04  9.089e-04   0.207 0.835675    
## volatileacidity    -1.190e-02  7.304e-03  -1.629 0.103221    
## citricacid          2.826e-03  6.529e-03   0.433 0.665133    
## residualsugar      -6.728e-05  1.671e-04  -0.403 0.687225    
## chlorides          -2.201e-02  1.782e-02  -1.235 0.216706    
## freesulfurdioxide   1.554e-05  3.723e-05   0.417 0.676474    
## totalsulfurdioxide -1.600e-05  2.418e-05  -0.662 0.508230    
## density            -2.167e-01  2.149e-01  -1.008 0.313227    
## ph                  4.417e-03  8.386e-03   0.527 0.598379    
## sulphates           1.815e-03  6.061e-03   0.299 0.764636    
## alcohol             6.170e-03  1.537e-03   4.016 5.93e-05 ***
## labelappeal-1       4.351e-01  4.489e-02   9.694  < 2e-16 ***
## labelappeal0        7.245e-01  4.383e-02  16.532  < 2e-16 ***
## labelappeal1        9.131e-01  4.456e-02  20.491  < 2e-16 ***
## labelappeal2        1.071e+00  4.947e-02  21.647  < 2e-16 ***
## acidindex          -1.931e-02  5.336e-03  -3.619 0.000295 ***
## stars1              6.819e-02  2.299e-02   2.967 0.003010 ** 
## stars2              1.888e-01  2.152e-02   8.772  < 2e-16 ***
## stars3              2.870e-01  2.252e-02  12.742  < 2e-16 ***
## stars4              3.833e-01  2.777e-02  13.804  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -6.808e+00  1.517e+00  -4.486 7.24e-06 ***
## fixedacidity        3.174e-03  5.903e-03   0.538 0.590865    
## volatileacidity     2.159e-01  4.783e-02   4.513 6.38e-06 ***
## citricacid         -1.453e-02  4.351e-02  -0.334 0.738408    
## residualsugar      -1.410e-03  1.096e-03  -1.286 0.198283    
## chlorides           2.097e-02  1.168e-01   0.180 0.857473    
## freesulfurdioxide  -8.157e-04  2.531e-04  -3.223 0.001270 ** 
## totalsulfurdioxide -1.010e-03  1.629e-04  -6.201 5.60e-10 ***
## density             6.806e-01  1.425e+00   0.478 0.632943    
## ph                  1.955e-01  5.412e-02   3.613 0.000303 ***
## sulphates           8.649e-02  3.953e-02   2.188 0.028671 *  
## alcohol             1.860e-02  1.009e-02   1.844 0.065253 .  
## labelappeal-1       1.646e+00  3.857e-01   4.266 1.99e-05 ***
## labelappeal0        2.365e+00  3.833e-01   6.169 6.87e-10 ***
## labelappeal1        3.088e+00  3.890e-01   7.938 2.06e-15 ***
## labelappeal2        3.419e+00  4.426e-01   7.725 1.12e-14 ***
## acidindex           4.290e-01  2.869e-02  14.952  < 2e-16 ***
## stars1             -2.135e+00  8.357e-02 -25.548  < 2e-16 ***
## stars2             -5.679e+00  3.388e-01 -16.764  < 2e-16 ***
## stars3             -2.025e+01  3.693e+02  -0.055 0.956273    
## stars4             -2.036e+01  6.933e+02  -0.029 0.976572    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 47 
## Log-likelihood: -1.719e+04 on 42 Df
x
AIC 34471.98
Dispersion 0.45
Log-Lik -17193.99

Diagnostics

Using a Zero-Inflated model, the Dispersion Parameter drops significantly, but we are getting a better overall result for counts of 3 or more. By graphing our target values (green) against our predicted values (blue) we can see we are getting much greater accuracy rate for most of the mid- and upper counts.

Notably, we are still under-predicting counts of 1-2, and greatly over-predicting counts of zero.


Negative Binomial Regression 1

Generally, we would use Negative Binomial Regression in cases of over-dispersion (where the variance of our dependent variable is significantly greater than the mean.) This does not appear to be the case with our dataset, but we’ll apply it here and examine the results:

nb1 <- glm.nb(target ~ ., data = df_train)
## 
## Call:
## glm.nb(formula = target ~ ., data = df_train, init.theta = 40652.27005, 
##     link = log)
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         6.607e-01  2.164e-01   3.053 0.002268 ** 
## fixedacidity       -3.809e-04  8.857e-04  -0.430 0.667150    
## volatileacidity    -3.388e-02  7.094e-03  -4.776 1.79e-06 ***
## citricacid          5.441e-03  6.408e-03   0.849 0.395823    
## residualsugar       7.921e-05  1.632e-04   0.485 0.627468    
## chlorides          -3.139e-02  1.735e-02  -1.810 0.070366 .  
## freesulfurdioxide   8.967e-05  3.702e-05   2.422 0.015419 *  
## totalsulfurdioxide  8.206e-05  2.422e-05   3.388 0.000704 ***
## density            -2.293e-01  2.084e-01  -1.101 0.271059    
## ph                 -1.128e-02  8.191e-03  -1.378 0.168352    
## sulphates          -6.588e-03  5.906e-03  -1.115 0.264662    
## alcohol             3.583e-03  1.500e-03   2.388 0.016935 *  
## labelappeal-1       2.206e-01  4.128e-02   5.345 9.06e-08 ***
## labelappeal0        4.158e-01  4.021e-02  10.340  < 2e-16 ***
## labelappeal1        5.450e-01  4.090e-02  13.324  < 2e-16 ***
## labelappeal2        6.927e-01  4.609e-02  15.030  < 2e-16 ***
## acidindex          -7.994e-02  4.996e-03 -16.001  < 2e-16 ***
## stars1              7.850e-01  2.123e-02  36.980  < 2e-16 ***
## stars2              1.096e+00  1.984e-02  55.224  < 2e-16 ***
## stars3              1.215e+00  2.088e-02  58.191  < 2e-16 ***
## stars4              1.330e+00  2.633e-02  50.521  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(40652.27) family taken to be 1)
## 
##     Null deviance: 19378  on 10831  degrees of freedom
## Residual deviance: 11477  on 10811  degrees of freedom
## AIC: 38542
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  40652 
##           Std. Err.:  36846 
## Warning while fitting theta: iteration limit reached 
## 
##  2 x log-likelihood:  -38498.03
x
AIC 38542.03
Dispersion 0.88
Log-Lik -19249.02

Diagnostics

As expected, the Negative Binomial Regression does not outperform the Poisson.


Negative Binomial Regression 2

We’ll build a Zero-Inflated Negative Binomial model to handle the large number of zero values in our labelappeal and stars predictors, to see if we can improve model accuracy.

nb2 <- zeroinfl(target ~ . | ., data=df_train, dist = 'negbin')
## 
## Call:
## zeroinfl(formula = target ~ . | ., data = df_train, dist = "negbin")
## 
## Pearson residuals:
##       Min        1Q    Median        3Q       Max 
## -2.275890 -0.428238 -0.001389  0.382527  5.195515 
## 
## Count model coefficients (negbin with log link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         7.204e-01  2.238e-01   3.219 0.001288 ** 
## fixedacidity        1.885e-04  9.089e-04   0.207 0.835670    
## volatileacidity    -1.190e-02  7.304e-03  -1.629 0.103225    
## citricacid          2.826e-03  6.529e-03   0.433 0.665139    
## residualsugar      -6.728e-05  1.671e-04  -0.403 0.687221    
## chlorides          -2.201e-02  1.782e-02  -1.235 0.216708    
## freesulfurdioxide   1.553e-05  3.723e-05   0.417 0.676476    
## totalsulfurdioxide -1.599e-05  2.418e-05  -0.662 0.508290    
## density            -2.167e-01  2.149e-01  -1.008 0.313280    
## ph                  4.418e-03  8.386e-03   0.527 0.598367    
## sulphates           1.815e-03  6.061e-03   0.299 0.764627    
## alcohol             6.170e-03  1.537e-03   4.016 5.93e-05 ***
## labelappeal-1       4.351e-01  4.489e-02   9.694  < 2e-16 ***
## labelappeal0        7.245e-01  4.383e-02  16.532  < 2e-16 ***
## labelappeal1        9.131e-01  4.456e-02  20.492  < 2e-16 ***
## labelappeal2        1.071e+00  4.947e-02  21.647  < 2e-16 ***
## acidindex          -1.931e-02  5.336e-03  -3.619 0.000295 ***
## stars1              6.819e-02  2.299e-02   2.967 0.003010 ** 
## stars2              1.888e-01  2.152e-02   8.772  < 2e-16 ***
## stars3              2.870e-01  2.252e-02  12.742  < 2e-16 ***
## stars4              3.833e-01  2.777e-02  13.804  < 2e-16 ***
## Log(theta)          1.746e+01        NaN     NaN      NaN    
## 
## Zero-inflation model coefficients (binomial with logit link):
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -6.808e+00  1.517e+00  -4.486 7.24e-06 ***
## fixedacidity        3.174e-03  5.903e-03   0.538 0.590857    
## volatileacidity     2.159e-01  4.783e-02   4.514 6.38e-06 ***
## citricacid         -1.453e-02  4.351e-02  -0.334 0.738496    
## residualsugar      -1.410e-03  1.096e-03  -1.286 0.198279    
## chlorides           2.097e-02  1.168e-01   0.180 0.857452    
## freesulfurdioxide  -8.158e-04  2.531e-04  -3.223 0.001270 ** 
## totalsulfurdioxide -1.010e-03  1.629e-04  -6.201 5.60e-10 ***
## density             6.807e-01  1.425e+00   0.478 0.632890    
## ph                  1.955e-01  5.412e-02   3.613 0.000303 ***
## sulphates           8.649e-02  3.953e-02   2.188 0.028665 *  
## alcohol             1.860e-02  1.009e-02   1.844 0.065249 .  
## labelappeal-1       1.646e+00  3.857e-01   4.266 1.99e-05 ***
## labelappeal0        2.365e+00  3.833e-01   6.169 6.86e-10 ***
## labelappeal1        3.088e+00  3.890e-01   7.938 2.05e-15 ***
## labelappeal2        3.419e+00  4.426e-01   7.726 1.11e-14 ***
## acidindex           4.290e-01  2.869e-02  14.952  < 2e-16 ***
## stars1             -2.135e+00  8.357e-02 -25.548  < 2e-16 ***
## stars2             -5.679e+00  3.388e-01 -16.765  < 2e-16 ***
## stars3             -2.026e+01  3.704e+02  -0.055 0.956386    
## stars4             -2.036e+01  6.939e+02  -0.029 0.976589    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 38155671.6535 
## Number of iterations in BFGS optimization: 58 
## Log-likelihood: -1.719e+04 on 43 Df
x
AIC 34473.98
Dispersion 0.45
Log-Lik -17193.99

Diagnostics

The Zero-Inflated Negative Binomial model sees similar improvement as with the Zero-Inflated Poisson, but as before does not outperform the Poisson.


Multiple Linear Regression 1

For our first Multiple Linear Regression, we’ll use all predictors.

lm1 <- lm(target ~ ., data=df_train)
## 
## Call:
## lm(formula = target ~ ., data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0032 -0.8546  0.0105  0.8407  5.5618 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.775e+00  4.828e-01   5.748 9.26e-09 ***
## fixedacidity       -6.975e-04  1.994e-03  -0.350 0.726459    
## volatileacidity    -1.045e-01  1.597e-02  -6.544 6.28e-11 ***
## citricacid          1.838e-02  1.455e-02   1.263 0.206660    
## residualsugar       2.339e-04  3.684e-04   0.635 0.525561    
## chlorides          -9.922e-02  3.915e-02  -2.534 0.011286 *  
## freesulfurdioxide   2.629e-04  8.373e-05   3.140 0.001695 ** 
## totalsulfurdioxide  2.373e-04  5.434e-05   4.367 1.27e-05 ***
## density            -7.335e-01  4.707e-01  -1.558 0.119215    
## ph                 -2.940e-02  1.841e-02  -1.597 0.110383    
## sulphates          -1.332e-02  1.329e-02  -1.002 0.316209    
## alcohol             1.176e-02  3.370e-03   3.491 0.000483 ***
## labelappeal-1       3.389e-01  6.818e-02   4.970 6.79e-07 ***
## labelappeal0        8.168e-01  6.643e-02  12.295  < 2e-16 ***
## labelappeal1        1.270e+00  6.934e-02  18.310  < 2e-16 ***
## labelappeal2        1.899e+00  9.160e-02  20.729  < 2e-16 ***
## acidindex          -1.993e-01  9.879e-03 -20.180  < 2e-16 ***
## stars1              1.398e+00  3.554e-02  39.320  < 2e-16 ***
## stars2              2.411e+00  3.457e-02  69.757  < 2e-16 ***
## stars3              2.974e+00  4.001e-02  74.339  < 2e-16 ***
## stars4              3.647e+00  6.357e-02  57.364  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.299 on 10811 degrees of freedom
## Multiple R-squared:  0.546,  Adjusted R-squared:  0.5451 
## F-statistic:   650 on 20 and 10811 DF,  p-value: < 2.2e-16
x
AIC 36431.79
Adj R2 0.55

Diagnostics


Multiple Linear Regression 2

For our second Multiple Linear Regression, we’ll add stepwise feature selection.

lm2_all <- lm(target ~ ., data=df_train)
lm2 <- stepAIC(lm2_all, trace=FALSE, direction='both')
## 
## Call:
## lm(formula = target ~ volatileacidity + chlorides + freesulfurdioxide + 
##     totalsulfurdioxide + density + ph + alcohol + labelappeal + 
##     acidindex + stars, data = df_train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0357 -0.8560  0.0131  0.8396  5.5892 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.772e+00  4.825e-01   5.745 9.46e-09 ***
## volatileacidity    -1.048e-01  1.597e-02  -6.566 5.42e-11 ***
## chlorides          -1.000e-01  3.915e-02  -2.556  0.01061 *  
## freesulfurdioxide   2.633e-04  8.368e-05   3.146  0.00166 ** 
## totalsulfurdioxide  2.390e-04  5.431e-05   4.400 1.09e-05 ***
## density            -7.368e-01  4.706e-01  -1.566  0.11742    
## ph                 -2.922e-02  1.841e-02  -1.587  0.11249    
## alcohol             1.178e-02  3.368e-03   3.499  0.00047 ***
## labelappeal-1       3.392e-01  6.818e-02   4.976 6.59e-07 ***
## labelappeal0        8.168e-01  6.642e-02  12.298  < 2e-16 ***
## labelappeal1        1.270e+00  6.932e-02  18.319  < 2e-16 ***
## labelappeal2        1.899e+00  9.160e-02  20.733  < 2e-16 ***
## acidindex          -1.994e-01  9.699e-03 -20.564  < 2e-16 ***
## stars1              1.398e+00  3.554e-02  39.354  < 2e-16 ***
## stars2              2.413e+00  3.455e-02  69.840  < 2e-16 ***
## stars3              2.976e+00  3.999e-02  74.407  < 2e-16 ***
## stars4              3.649e+00  6.356e-02  57.411  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.299 on 10815 degrees of freedom
## Multiple R-squared:  0.5458, Adjusted R-squared:  0.5452 
## F-statistic: 812.3 on 16 and 10815 DF,  p-value: < 2.2e-16
x
AIC 36426.96
Adj R2 0.55

Diagnostics


Model Evaluation


Predictions


Conclusion


Appendix

References

‘Total Sulfur Dioxide – Why it Matters, Too!’
Iowa State University
https://www.extension.iastate.edu/wine/total-sulfur-dioxide-why-it-matters-too/

R Code